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Abstract 


Using nonparametric methods has been increasingly explored in Bayesian hierar¬ 
chical modeling as a way to increase model flexibility. Although the field shows 
a lot of promise, inference in many models, including Hierachical Dirichlet Pro¬ 
cesses (HDP), remain prohibitively slow. One promising path forward is to ex¬ 
ploit the submodularity inherent in Indian Buffet Process (IBP) to derive near- 
optimal solutions in polynomial time. In this work, I will present a brief tutorial 
on Bayesian nonparametric methods, especially as they are applied to topic mod¬ 
eling. I will show a comparison between different non-parametric models and the 
current state-of-the-art parametric model. Latent Dirichlet Allocation (LDA). 


1 Research Goals: 


My goals on the outset of this experiment were to: 

• Learn more about nonparametric models 

• Investigate different non-parametric topic models and compare them to an LDA process 
run on a New York Times dataset with topics k = 55. 

• Pick higher-performing algorithms and apply recent research in Indian Buffet Process sub¬ 
modularity to these algorithms to assess speed increase. 

I should note that my experiments were not successful. The non-parametric topic modeling algo¬ 
rithms that I compared to LDA had consistently worse perplexity scores. Thus, I did not go forward 
with proposed implementation of submodular methods. This speaks less to the potential of nonpara¬ 
metric models then to the need for more work in the field and the need to lower the barrier of entry-I 
found that unifying introductory texts were far and few between. 

Thus, I will treat this paper as the start of a tutorial I plan to build out in the next few months. While 
I am certainly not an expert, I approached the subject as a beginner just a few weeks ago. This paper 
gives me an opportunity to offer a beginner’s look at non-parametrics, one that I will expand on. 
Although I feel that I have not contributed in a significant way research-wise, if I expand this paper 
over the next few months to give a better high-level introduction to nonparametric modeling, this 
can be an important contribution. 

I have tried to make my emphasis in this paper orthogonal to my presentation. As in, I will go 
lightly on the topics that I went heavily into during my presentation and try to go more deeply into 
the background of processes. 
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2 Introduction 


Bayesian hierarchical techniques present unified, flexible and consistent methods for modeling real- 
world problems. In this section, we will review parametric models by using the Latent Dirichlet 
Allocation model (LDA), presented by Blei et. al. (2003) [1], as an example. We will then introduce 
Bayesian Processes and show how they factor into non-parametric models. 

2.1 Bayesian Parametric Models: Latent Dirichlet Allocation (LDA) 



The LDA topic-model is a proto-typical generative Bayesian model. It is ’’generative” in the sense 
that it assumes that some real-world observation (in this case, word-counts), are generated by series 
of draws from underlying distributions. According to Blei’s hypothesis, the LDA model is generated 
as follows: 

For each document w in corpus D\ 

1. Choose N ~ Poisson{K) 

2. Choose 6 ^ Dir{a) 

3. For each of the N words Wn- 

• Choose a topic ~ Multinomial{9) 

• Choose a word Wn ~ Multinomial{j3zn) 

These underlying distributions are hidden from us when we observe our sample set (the corpus), 
but through various methods we can infer their parameterization. This allows us to predict future 
samples, compress information, and explain existing samples in useful ways. 

The LDA is a mixture model-it assumes that each word is assigned a class, or ’’topic”, and that each 
document is represented by a mixture of these topics. The overall number of topics in a corpus, 
k, needs to be parameterized by the user, and this is often difficult to interpret. To overcome this 
limitation, we introduce non-parametric models. 

2.2 Bayesian Nonparametric Models: An Introduction 

Gaussian Processes: Function Modeling 

We start our discussion on nonparametrics by discussing Gaussian Processes, since their construc¬ 
tion follows the way in which processes we actually use will be constructed, yet I feel Gaussian- 
anything is a naturally easier concept to grasp. Let’s say X is distributed according to a Gaussian 
Process with mean measure to(x) and covariance k{x,x ): 

X ~QV{m{x),k{x,x)) 

Mean measure and covariance measures are simply functions that take some possibly infinite subset 
X of Euclidean (or, more generally, Hilbert space) and return a measure. It is most helpful when uses 
processes in Bayesian nonparametrics to think of a ’’measure” as our prior belief, expressed not as a 
paramter but as a mapping function. 
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In other words, much like a function maps an input space to a quantity, measures are used to quantize 
finite sets. For example, lets take the mean measure. In the Gaussian Process, to(x) = i?[/(x)]. 
This maps a finite (or infinite) set x to an expected value. In practice, when we have no information 
about f{x), it’s common to choose m = 0, as a suitable measure prior. (This is reflected in Figure 
2. a which shows our prior beliefs about the QV, where the span of the orocess is given by the gray 
region, with mean 0 and constant variance.) 

Over a finite subset of Euclidean space, x, the measure takes an expected, real value, and the Gaus¬ 
sian Process takes a discrete distribution: a joint Gaussian. For example, if xi, X 2 are subsets of the 
real line, then j/i, y 2 are distributed as: 


(yi,y 2 ) ~0T’(TO({xi,X2}),fc({xi,X2},{xi,X2}')) = A/'( ,E) 

This joint Gaussian property of the QV, over an infinite subspace, can describe the set of points 
forming functions, shown in Figure 2. We can envision starting with a finite number of discrete 
subsets of the real line, represented as the blue points in the diagram, and continuing to add samples. 
As the number of subsets we add approaches infinity, we have encompassed the entire real line, 
and created a joint Gaussian who’s realization forms a continuous function. This ability of the QV 
to model functions allows it to be used in kernel-based non-linear regression as a prior. (For more 
information, see [3]) 



input, X 
(a), prior 



-5 0 5 

input. X 

(b), posterior 


Figure 2: Gaussian Process used to describe the function space, (a) Prior, with QV{0, E). The 
blue, dotted line represents QV over bounded subspaces while the red and blue lines represent the 
unbounded cases, (b) Posterior, updated to fit observed data. [3] (Rasmussen, 2006) 

Dirichlet Processes: Probability Modeling 

Similarly to how the Gaussian Process can be applied over infinite subspaces to model functions, 
the Dirichlet Process can model continuous probability densities. This is also rather intuitive. 

Fet’s start with a finite set, A. If 11 is a partition of A such that IJti Ai = A and Tk H T; = 0 for 
k I, then: 

(G(ni),..., G(nfe)) ~DiriaH(Ili),..., aH{nk)) 

Where G is drawn from a Dirichlet Process (W) with base measure H and concentration parameter 
a: 


G~VV{a,H) where F[G(n,)] = 

We can think of the base measure H in the Dirichlet Process the same way as we think of the mean 
measure in the QV, simply as a mapping between subsets of Hilbert space to reals-in this case reals 
G [0,1]. 

Now, using a Dirichlet process to model the space of continuous probability functions makes sense 
when we consider the special property of the Dirichlet Distribution that a sample from the distri- 
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bution is a tuple that sums to 1-essentially a probability mass function. Thus, a Dirichlet Process 
over infinitely many partitions gives a continuous probability distribution much the same way that a 
sampling from a QV over infinitely many subsets of the real line gives a continous function. 

The Dirichlet Process has found many uses in modeling probability distributions. One example is the 
Hierarchical Dirichlet Process (HDP) by Yee Whye Teh [4]. This model uses two stacked Dirichlet 
Processes. The first is sampled to provide the base distribution for the second, allowing a fluid 
construction of point mixture-modeling known as the ’’Chinese Restaurant Franchise”. However, 
the HDP has noted flaws, in particular: because the HDP draws the class proportions from a dataset- 
level joint distribution, it makes the assumption that the weight of a component in the entire dataset 
(in our application, the ’’corpus”) is correlated with the proportion of that component being expressed 
within a datapoint (a ’’document”). Or, in other words, the probability of a datapoint exhibiting a 
class is correlated with the weight of that class within the datapoint. Intuitively, in the topic modeling 
context, we might argue that rare corpus-level topics are often expressed to a great degree within 
specific documents, thus indicating that the HDP is flawed. 

Beta Process: Point Event Modeling 

The Beta Process offers us a way to perform class assignment without this correlational bias. We 
will see more in the next section. 

Taking a step back, a draw from a Beta Distribution is a special case of a Dirichlet distribution draw 
over over two classes. Similarly, the Beta Process is defined over the product space fl x [0,1]. 

A draw from a Beta Process is given as: 


where B ~BP{c, Bq), and Suj^, can be thought of as a unit measure of Wk- p and w are defined by 
the Levy measure, v, of the Beta Process’ product space, (given by Hjort (1990) [6]), is 


VBp{dpdw) = cp ^{1—pY ^dpBo{dw) 

which can be passed as the mean measure to a Poisson Process. (Where c > 0 is the concentra¬ 
tion parameter and Bq is the continuous-finite base measure). (This is commonly used in Levy- 
Kliintchine formulations of stochastic processes.) 

Therefore, we can see that BP draws need not sum to one. Taken over infinite subspaces, the Beta 
Process can be used to model CDF’s, as point measurements are € [0,1], but draws are not dependent 
on the sum of subspaces, like the Dirichlet. Taking the aggregate of point measurements, we can 
derive CDF’s, which are especially useful in fields like survival analysis. 

As we can see, wj point measurements can also be used to model class assigments. Thus, Beta 
Processes can effectively model each class-membership separately. Per datapoint, probability mass 
behind n subspaces, or classes, is no longer bounded by one as in the Dirichlet Process. (For more 
on Beta Processes see Paisley [7] and Zhou [8].) 


2.3 Bernouilli Processes and the Indian Buffet Process 

Here we construct such a mixed-membership modeling scheme. The Beta Process is conjugate to 
the Bernouilli family. (A Bernouilli process is a very simple stochastic process that can be thought 
of simply as a sequence of coin flips with probability p). This makes the Beta-Bernouilli pairing an 
ideal candidate for mixed-membership applications like topic modeling, where each point data point 
is expressed as a combination of latent classes. 

Griffiths and Gharamani [9] recognized this in 2005, and constructed the Indian Buffet Process (IBP) 
by marginalizing out the Beta process. As Thibaux and Jordan [10] would later show in 2007, given: 

B ~BP{c, Bq) {X,\B}i,„n ~BeP{B) 
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where BeP is the Bernouilli Process, the Beta Process can be marginalized out to give: 

/ n 

^ BeP{ -^Bo + V 

\c + n c + n 

^ i—1 

where Bq is a continuous base distribution with mass Bq{^1) = 7 , Sw is the unit point mass at w, 
and = X:r=i number of datapoints in Wj. Thus, we see through this marginalization 

that the class labels j are independent of order-as long as the order of Wj is consistent across Xn, 
the BeP construction remains the same. This realization can be used to prove exchangeability of 
the beta-assigned class labels, proving that these points are a De-Finetti mixture. (For more details, 
see [10]. De-Finetti mixtures are great because they can be realized in provably defined probability 
distributions.) 

If we model these process as their corresponding distributions, we can derive a probability function. 


Q! 

TTfela - Beta{ — , 1) 
Zik\T^k Bernoulli{Trk) 


P{Z) 


K 


k=l • 


N 


n / 


Z=1 


p{-Kk)d-Kk 



Figure 3; Graphical model for the Indian Buffet Process, [9] 


Where the probability of observing a set of class assignments { 2 : 71 ,..., Zk^i] for each datapoint i is 
given by P{Z). Under the exchangeability property of the Beta-Bernouilli construction given above, 
the Indian Buffet Process can be expressed as the sum of infinite classes: 


^■([Zl) = 

; n / ( WPi^iAT^k) )pi'!Tk)dTTk 

^h- k=l'' ^i=l ^ 


P([Z]|a,/3) 


r(mfc)r(iV -ruk+P) 

' L\ r(iv + /3) 


Where F represents the Gamma function and a and /3 are concentration parameters, kh represents 
the history term of the lof ordering. In addition to reordering the class labels in a way that sums 
probability only over the ’’active” set K+ of topics, this construction collapses multiple classes 
assigned to the same datapoints into a single class label. For more details on the combinatorial 
ordering scheme see slide 22 of my presentation, included in this folder. This formulation groups 
the class labels of the IBP datapoints in a way that allows the probability distribution to remain 
well-defined even as the set of classes is unbounded. Thus, we can draw inference on the infinite 
parameter space through a finite set of observations. 
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3 Applications of the Indian Buffet Process: Experiments 

The Indian Buffet Process has been applied to a host of mixed-model schemes. Like the Hierachical 
Dirichlet Process, it models class membership in a non-parametric way, thus making it appealing 
for use in non-parametric Bayesian models. Moreso, because it escapes the correlation that HOP 
introduces, the IBP has been recently studied in topic modeling applications. 

One simple example is the Focused Topic Model (Blei et al., 2010) [11]. The Focused Topic Model 
uses an IBP Compound Dirichlet Process, which effectively decouples the correlation between high 
corpus-level probability mass for a topic, and high document-level probability mass. The topic 
weights 6i for document i, now, are generated from from a Dir{(j) ■ b^), where represents the 
binary draw from an IBP. Thus, strong corpus-level topics may be ”shut-off ’ in the document. 

Archambeau et. al. take this a step further in [12] by applying an IBP compound Dirichlet to both the 
document topics 9i and the words j3w In other words, while the Focused Topic model decouples the 
corpus-level strength of a topic with the probability of a document expressing a topic, it still allows 
words to be weighted strongly under high probability topics, encouraging distributions across words 
to be more uniform. Archambeau et. al. add the IBP compound Dirichlet to word probabilities as 
well, creating a doubly compounded model, which they call LiDA. 

Mingyuan Zhou took a different approach [13]. Instead of directly incorporating the IBP, he built 
a topic model around the Beta-Negative Binomial Process, a hierarchical topic model which is re¬ 
lated to the IBP (the IBP is a Beta-Bemouilli construction) but is not binary, and thus has greater 
expressiveness. He presented his work at the 2014 NIPS conference. 

Experiments: 

I ran trials for all three of these on New York Times dataset of 5294 documents and 5065 words, 
which I compared against a control of LDA with 55 topics. I chose this number has it was the 
current 30-day window of articles that we use for training our recommendations system. I heldout 
10 percent of articles for a perplexity test. The log-perplexity of a heldout set was -7.9 for LDA on 
55 topics, 2.099 for LiDA, which scaled to 66 topics, 5.6 for FTM and 5.4 for BNBP 

4 Submodularity of the IBP 

Current inference algorithms for IBP-related processes involve either sequential Gibbs samplers or 
sequential variational inference, effectively creating an NP-hard problem. Recent work by Reed and 
Griffiths [14] examining the IBP has focused on its submodularity properties. Although I explored 
this in detail in my presentation, I will not go deeply into it here, as it is not relevant. 
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5 Conclusion 

In this paper, I’ve tried to give a brief tutorial on some commonly used Bayesian nonparametrics, as 
well as explain some motivations behind their use. I’ve tried to steer clear of the euphemisms many 
introductory reviews use to explain nonparametrics, like stick-breaking processes or poyla-urns, in 
favor of a more general explanations. I’ve reviewed some recent implementations of nonparametric 
models and compared them on a single dataset. 

Although the performance of the models I tested was generally lacking and the methods were slow, 
I still feel that nonparametrics offer a sophisticated approach towards constructing flexible Bayesian 
models. With more research, the I’m confident that the field could produce some promising results. 
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